Simultaneous Gene Clustering and Subset Selection for Sample Classi cation via MDL
نویسندگان
چکیده
Motivation: The microarray technology allows for the simultaneous monitoring of thousands of gene expression for each sample. The high-dimensional gene expression data can be used to study similarities of genes' expression pro les across di erent samples, that is gene clustering. The gene clusters may be indicative of genetic pathways. Another important application is sample classi cation, based on all or selected gene expressions. The gene clustering and sample classi cation are often undertaken separately, or in a directional manner (one as aid for the other). However, such separation of the two tasks may occlude informative structure in the data. Here we present an algorithm for the simultaneous clustering of genes and subset selection of gene clusters for sample classi cation. We develop a new model selection criterion based on Rissanen's MDL (minimum description length) principle. For the rst time, an MDL code length is given for both explanatory variables (genes) and response variables (sample class labels). Results: We apply our algorithm for simultaneous gene clustering and subset selection for classi cation to 3 data sets: (1) the NCI60 data set of Ross et al. (2000), (2) the Acute Leukemia data set of Golub et al. (1999), (3) the Colon tumor vs. normal data set of Alon et al. (1999). Competitive or improved test error rates, compared with the best reported methods, are demonstrated on all 3 data sets. Furthermore, we also show that the model selection instability associated with single gene selection is much reduced by the selection of gene clusters. We comment on how the inclusion of a classi cation criterion a ects the gene clustering, bringing out class informative structure in the data. Availability: The methods presented in this paper have been implemented in the R language. The source code is available from the rst author. Contact: [email protected], [email protected] INTRODUCTION Gene clustering identi es groups of genes that exhibit similar expression pro les across samples. Examples of gene clustering methods that have been used are hierarchical clustering, k-means, partitioning around mediods (PAM), and the SVD based geneshaving of Hastie et al. (2000a). For sample classi cation, we build predictive models for the sample classes based on all or selected gene expressions. Many classi cation methods have been applied to gene expression data: some traditional methods such as discriminant analyses (DA), nearest neighbors (NN), and Classi cation and regression trees (CART): some novel and more complex methods such as support vector machines (SVM), boosting and bagging CART, and neural nets. On most data sets, the simple and complex methods perform near equally well, or poorly. In fact, a recent study of Dudoit et al. (2000) indicates that simple NN, or diagonal linear DA (DLDA) often result in the best test error rate performance. Gene expression data sets consist of few samples on which to build and validate models. It is possible that the use of more complex classi ers will be more justi ed and necessary as the number and size of data sets grow. Many classi ers show improved performance with variable or feature selection. In addition, parsimonious classi cation models are often easier to interpret. This motivates gene or variable subset selection. Gene clustering and sample classi cation with variable selection are in most cases treated as separate problems, or in a directional manner (one as an aid for the other). As an example of the rst, Hong and Li (2001) cluster genes and use cluster representatives with \soft-max" sample classi cation, without variable selection. An example of the latter, the supervised gene shaving of Hastie et al. (2000a) uses the sample class information for improved gene clustering. Another example of a directional approach, is the supervised harvesting of expression trees of Hastie et al. (2000b). There, initial clustering is followed by the selection of cluster centroids for classi cation, but the clustering remains xed. Here, we view the clustering of genes, and the subsequent selection of clusters that discriminate between samples, as one model selection problem. We use Rissanen's minimum description length (MDL) principle, which formalizes Occam's razor. MDL stipulates that one should choose the model with the shortest description, or code length of the data. Most of the time, the code lengths take on the form of a penalized likelihood, where the penalty re ects the model complexity. By turning classi cation (linear discriminant analysis) into regression via Optimal Scoring, we form a combined code length for the gene clustering and sample classi cation. We use a two-part code, where the rst part describes the gene clustering based on a Gaussian mixture model, and the second part describes the sample class labels conditioned on the gene clusters. We select the model that minimizes the sum of the two parts of the code length. Thus, the clustering is a ected by the ability of the corresponding cluster centers to discriminate between the sample classes. Similarly, the selection of clusters for classi cation is a ected by the clustering. For the rst time, an MDL model selection criterion is given for both explanatory variables (genes), and response variables (class labels). Competitive or improved test error rates, compared with the best reported methods, are shown on the National Cancer In1 stitute's anti-cancer drug-screen data (NCI60) of Ross et al. (2000), the Acute Leukemia data set of Golub et al. (1999), and the Colon data set of Alon et al. (1999). We also show that the model selection instability is reduced by selecting gene clusters, rather than individual genes. The simultaneous clustering and subset selection for classi cation results in gene clusters with class informative structure. The paper is organized as follows. In the Methods section we brie y outline the MDL principle, optimal scoring, and the derivation of our simultaneous clustering and subset selection MDL criterion. Test error rates and model selection results are presented in the Results section for the 3 gene expression data sets. We conclude with a discussion of our method and ideas for extensions. METHODS Rissanen's minimum description length principle formalizes Occam's razor, and states that one should choose the model that gives the shortest description, or code length of the data. In the MDL framework, a model is meant to aid in the data coding or transmission. The basic idea is that though a complex model reduces the data coding cost, given the model, the total code length required to describe the model and the data may be smaller if we use a simple model. MDL model selection is thus trading o model complexity with model goodness-oft, based on the model's data coding e ciency. There are many ways of constructing code lengths for a data string xn. Here we restrict the discussion to parametric models, indexed by parameter vectors . A code length based on a mixture distribution on the members of the model family de ned , is called a mixture code length, and is given by CL(xn) = log R f(xnj )w( )d : We have a closed form expression for the mixture code length if f(xnj ) is an exponential family, and w( ) is a corresponding conjugate prior. Comparing two models using the mixture code length is e ectively looking at Bayes factors. To the rst order, we can approximate the mixture code length by a two-stage code. We rst encode the estimated model parameters ̂ at precision 1=pn, and then the data given the estimated model. The two-stage code length is given by logf(xnĵ) dim( ) log 1=pn = logf(xnĵ)+dim( ) logn=2, which we recognize as BIC. Alternatively, via a Laplace expansion of CL(xn) we get Rissanen's Stochastic Information ComplexitySIC(xn) = logf(xnĵ)+ 1 2 log jnI(̂)j, with I(̂) equal to the Fisher information matrix. SIC can be interpreted as a two stage code, with a parameter value dependent precision SE(̂mle;n) = 1=pnIjj( ). Using coding arguments as a safeguard against over tting in clustering was rst introduced in 1968 by Wallace and Boulton. TheirMinimum Message Length (MML) information measure is interpreted from a strictly Bayesian point of view. The MDL and MML concepts are similar, and the di erences are philosophical rather than technical. The commonly used MDL (or MML) approach to clustering selects the number of clusters K in a data set using an SIC two-stage code (Wallace and Boulton (1994)). At the rst stage, we describe theK cluster models, the cluster sizes, and cluster memberships. At the second stage, we encode for each datum its deviance from the cluster model. For cluster k 2 f1; ;Kg, the kth cluster model is de ned by 3 parameters k: the centroid k, the standard deviation k of independent additive mean zero Gaussian errors, and the cluster proportion pk (mixture probability). Uninformative priors w are assigned to all parameters. The SIC approximation of the MML code length is MML = logw( ) + 1 2 log jI( )j log f(xnj ); where f(xnj ) is a Gaussian mixture likelihood and w( ) the prior. We have 1 2 log jI( )j = PKk=1 p2nk 2 k , where nk are the number of observations in cluster k. The deviances from the cluster models are ri = PKk=1 1fi 2 b(k)g (xi k), i = 1; ; n, where b(k) denotes the cluster membership of cluster k. If b(k) is empty, or the cluster variance is 0 (̂2 k = 0), no additional code is needed for ri. The MML code length is given by MML(K) = K log(2 2) log(K 1)! + K X k=1 p2nk 2 k n Xi=1 K X k=1 1fi 2 b(k); jb(k)j > 0; ̂2 k > 0g log (rij0; ̂2 k); and we select K that minimizes MML(K). The extension to D-dimensional data is straight forward. Model selection in linear regression identi es a subset of columns of the predictor matrix X , associated with the response variable y. The regression model is written as y = X + ; N(0; 2I); where is the coe cient vector. The indicator variable denotes which variables (columns of X) are included in the model. The twostage MDL code length BIC takes on the familiar form BIC = n2 log(RSS )+ k 2 logn; whereP 1f = 1g = k , and RSS = jjy X ̂ jj2 is the residual sum of squares with ̂ = (XT X ) 1XT y. BIC thus penalizes all models of size k equally. Hansen and Yu (2001) introduced the gMDL criterion, which includes an adaptive penalty on the model size, that is a penalty that re ects the goodness-oft of the model. The gMDL is a mixture MDL code length derived as follows. For convenience we drop the subscript from the notation, and denote the variance by = 2. The mixture code length for response variable y, conditional on X , is CL(yjX) = log R f(yjX; ; )w( ; )d d : The conjugate family of priors is the Normal-Inverse-Gamma, where we often use 2 j N(0; c ); (a=2)1=2 (a=2) 1 exp( a 2 ): Here, a and c are hyper parameters, and = (X 0X) 1 is the k k prior scaled covariance matrix for j . The integration with respect to prior w( ; ) is standard. In a strict Bayesian approach, we place hyperpriors on a and c. In the derivation of gMDL, Hansen and Yu take an empirical Bayes approach, and estimate the parameters a and c by maximizing the marginal likelihood. Hansen and Yu's gMDL criterion equals gMDL = n2 log(RSS) + k2 log ( R2 1 R2 )=( k=n 1 k=n) +n2 log( 1 1 k=n) + log(n); if R2 k=n, and n2 log(yT y) + 12 logn otherwise. Using gMDL, the penalty thus depends on the t of the model through the multiple r-squared R2, as well as the model size. gMDL implicitly compares models based on the coe cient estimates ̂gMDL, where ̂gMDL = max(1 1 k=n k=n 1 R2 R2 ; 0)̂LS: gMDL thus generates a form of shrinkage estimates. Optimal scoring and MDL in classi cation Model selection criteria in classi cation that are based on 0-1 loss are often intractable. It is therefore common to use an L2 loss approximation for selection. Here, we thus take the route of turning classi cation into regression via optimal scoring. We then appeal to the existing MDL methodology in regression to construct an MDL model selection criterion for classi cation. We restrict our attention to linear discriminant analysis (LDA) here. Let X be the matrix of feature vectors. In LDA we nd discriminant functions such that the linear combinations of feature vectors (projections onto the space spanned by the discriminant functions) X have maximal Betweento-Within-class sum of squares. LDA is thus an eigen value problem. Classi cation of a new observation is done in the subspace spanned by by assigning an observation to the closest class mean (euclidean distance) in the subspace. Hastie et al. (1994) discuss the familiar equivalence of LDA and regression via optimal scoring. The point of optimal scoring is to turn a categorical problem into a quantitative one. We nd the discriminant functions via linear regression, instead of solving the eigen value problem, by forming a n C dummy matrix Y , and regressing it on X . The jth column of Y has ith row entry \1" if sample i belongs to class j 2 f1; ; Cg, and \0" otherwise. The simultaneous estimation of the optimal scores, and the regression coe cients can be stated as min ;B jjY XBjj2; with constraint TDp = I , where Dp is a diagonal matrix of class proportions. Given , the minimizing B are the least squares coe cients such that min ;B jjY XBjj2 = min Trf( )T T (Y T Ŷ ) g: From this follows that a third alternative to nding is to minimize jj XB jj2; i.e. regress the orthogonal optimal scores = Y on X. The regression coe cient estimates B are proportional to the discriminant functions such that the lth discriminant function l = 1 pR2 l (1 R2 l )B l ; R2 l = 1 RSSl( l ; X)=(( l )T l ); where R2 l is the multiple r-squared of the lth column of regressed on X . The model selection (selecting columns of X) aspect of discriminant analysis is more complex. If the selection criterion is a function of the individualRSSl, the selection is a ected by choice of initial optimal scores , themselves functions of X . It is then recommended to iterate between selection, and updating of the optimal scores using the selected columns of X . We construct an MDL model selection criterion for classi cation by applying an extension of gMDL to multiresponse regression of the orthogonal optimal scores. Since the scores are orthogonal, the closest Gaussian approximation consists of independent response models. For a C class problem, there are C 1 non-trivial optimal scores . We write gMDLC =PC 1 l=1 gMDL( l jX), where gMDL( l jX) = n 2(C 1) log(C 1 Xl RSSl) +k2 log ( R2l 1 R2l ) ( k=n 1 k=n ) + n2 log( 1 1 k=n) + log(n); if R2 l k=n and n2 log(( l )T l ) + 12 logn otherwise. The derivation of gMLDC requires a lengthy and delicate discussion of the choice and form of model priors, and is omitted here to conserve space. The details can be found in Jornsten (2001). The shrinkage estimates of the discriminant functions equal ̂(l) = ̂(l) LS pR2 l (1 R2 l ) 1nR2 l k=no; where ̂ are the optimal score regression coe cient. For some l, the estimates ̂(l) may equal zero, and those discriminant functions are thus dropped from the model. Reducing the dimension in which we classify in this fashion, often improves the class predictive performance of the selected models. The optimal scoring turned the categorical problem into a quantitative one, but the response variables are nevertheless discrete. A Gaussian coding scheme may not be the most e ective, but we hope that the coding redundancy induced by the approximation is independent of the model size and complexity. Then, a 3 relative comparison of the gMDLC code lengths between models is still valid. In practice, we nd that the Gaussian based selection criterion performs well on simulated and real data (RESULTS section, and Jornsten (2001)). Simultaneous clustering and classi cation When the predictor variables are correlated, model selection in regression is di cult. The instability of such selection problems is well-known. We consider the case when the number of predictors p is very large compared to the sample size n. We approach the model selection by clustering the predictors, and selecting cluster centroids for response variable prediction. If a predictor cluster model describes the data well, the cluster centroids are less correlated than the individual predictors, which reduces the instability of selection. When p n, and the predictors highly correlated, using cluster centroids as new predictors is essentially equivalent to model averaging, with signi cant improved response predictive performance over individually selected models (Jornsten (2001)). MDL model selection in regression assumes that the predictors are known at the decoder. For the purpose of simultaneous clustering of the predictors and subset selection of cluster centroids for response variable prediction, we now assume that the predictors are not known. In this case, the predictors have to be transmitted to the decoder, prior to transmitting the response. We use the MDL clustering code length to encode the predictors. For now, x the number of predictor clusters to K. Given the clustering structure of the predictors, we encode the response using the gMDLC regression criterion and optimal scoring, where the explanatories in the regression model are now the K cluster centroids. The joint coding of ( ; X) is thus a simultaneous selection of a K cluster model for the predictors X , and the selection of cluster centroids in the regression model for . The joint code length has two parts. The rst part CL(X) deals with the cluster structure of the predictors. The second part describes the response, given the predictor cluster structure. We call this component CL( jX ), where X is a n K cluster centroid matrix (X )ik = 1 jb(k)jPpj=1 1fxij 2 b(k)gxij ; b(k) are the predictors in cluster k. The code lengths CL(X) and CL( jX ) have very di erent magnitudes. CL(X) is a code length of n p \observations" X . In contrast, only n (C 1) observations contribute to CL( jX ). We want the two code lengths to have the same relative importance if we increase the sample size, or include more predictors in the selection. It is therefore natural to consider the coding rates, and select the model that minimizes CL( ; X) = CL(X) np + CL( jX ) n(C 1) . By minimizing the joint coding rate the gene clustering is a ected by the inclusion of a classi cation model selection criterion. It is often better to reduce the dimension of X (n p) prior to clustering. Clustering high-dimensional data is di cult, and we tend to underestimate the number of clusters if we cluster in a higher dimension than necessary. For supervised gene clustering, we have outcome variables y that are categorical. If the within class variation is su ciently small, instead of clustering X , we cluster the reduced dimension matrix X, ( X)cj = 1 nc Pni=1 xij1fyi = cg; c = 1; ; C; j = 1; ; p; where nc is the number of samples in class c. If X consists of p \observations" from a C-dimensional cluster model, the residuals rij = xij xcj are clearly independent of the clustering, and can be transmitted at xed cost. This strategy is similar, though motivated di erently, to the supervised geneshaving of Hastie et al. (2000a). We now describe the simultaneous gene clustering and subset selection algorithm, where K is allowed to vary. We perform an initial selection of genes, removing genes with near constant variation across classes. We compute the between-to-within-class sum of squares (B/W), and consider only the T largest B/W genes in our models. Algorithm 1. Select the T largest B/W genes. Use the S T largest B/W genes to compute the initial optimal scores . Set q = 0. 2. Compute X , dimension C p ( X is a matrix of gene class means), ( X)cj = 1 nc Pni=1 xij1fyi = cg: 3. Select an initial number of clusters, Kq ; q = 0. 4. Using k-means, generate aKq clustering of X. Compute the clustering code length CL( X jKq). Denote the clustering assignment by bq(k); k = 1; ;Kq. 5. Compute X (Kq), dimension n Kq, the matrix of cluster centroids (X (Kq))ik = 1 jbq(k)jPpj=1 xij1fj 2 bq(k)g; i = 1; ; n; k = 1; ;Kq. Select k q centroids that minimize the code length CL( jX (Kq)). 6. Iterate steps 4 and 5 B(Kq) times, and choose the Kq model with the smallest total coding rate CL( XjKq) Cp + CL( jX (Kq)) (C 1)n . 7. Go to step 4, set Kq = Kq + 1, and q = q + 1. If q = qstop then stop, and select the q0 that minimizes CL( XjKq) Cp + CL( jX (Kq)) (C 1)n . 8. If necessary, recompute using the selected model, and rerun the algorithm from step 3. 9. The nal model for is a Kq0 gene clustering, selected kq0 cluster centroids, and a set Cq0 (C 1) discriminant functions. Here, we perform a direct search over di erent cluster sizes Kq, and adjust for the di erent sizes of the model classes for each Kq with the iteration factor B(Kq). 4 Validation and Comparison of Methods To estimate the test error rate of selected models we use threefold crossvalidation. Each crossvalidation training set is formed by randomly selecting two thirds of the samples. This training set is used for initial gene selection, model tting and selection. The withheld third of the full data set is used only for estimating the test error. We ensure that the crossvalidation training and test sets mimic the full data set, by resampling according to the class proportions in the full data set. Threefold crossvalidation is more appropriate than traditional leave-one-out crossvalidation when the number of samples is as small as in the gene expression data sets. Leave-one-out crossvalidation with few samples often leads to over tting and underestimation of the true error rate. In Dudoit et al. (2000), a comparative study of classi ers was conducted. The simple nearest neighbor (NN) and diagonal linear discriminant (DLDA) methods were found to give the best test error rate results on several gene expression data sets (including NCI60, and Acute Leukemia). We implement NN and DLDA here also. DLDA uses all T genes. NN is based on the k nearest neighbors, where k is selected by leave-one-out crossvalidation on the training set. We also implement the supervised geneshaving of Hastie et al. (2000a). This method resembles our approach, if the clustering is done separately from the subset selection. We follow Hastie et al. and generate 8 geneshaving clusters, with sizes chosen by the gap-statistic. Since the geneshaving clusters are (almost) orthogonal, a simple LDA classi ers is used in conjunction with supervised geneshaving. RESULTS We apply our algorithm for simultaneous clustering and subset selection for classi cation to 3 gene expression data sets. For all data sets we perform threefold crossvalidation 150 times to estimate the test error rate of the selected models, and to investigate the stability of model selection. We also run our algorithm on the full data sets, and discuss the selected models and gene clusters. For each training data set, an initial gene selection removes genes that have near constant variation across samples. For all 3 data sets there is a sharp drop-o in the betweento-within-class (B/W) sum of squares after 10-15 genes. We select the T=200 largest B/W genes as a starting point, and use the S=10 largest B/W genes to compute the initial optimal scores. We run the algorithm with the number of clusters ranging fromKq=1 to 20. For eachKq, we iterate B(Kq) times, where B(Kq)=50 for Kq 10, B(Kq)=100 for Kq=11 to 20. The B(Kq) were selected as reasonable trade-o between su cient sampling of the model space, and computing time. We allow for up to 20 clusters or individual genes to be selected in the optimal scoring regression model, and perform exhaustive search over all possible subsets. Model selection may be a ected by the choice of initial optimal scores (the S genes). To avoid the computationally intensive exercise of iterating the algorithm, we try to pick a reasonable S such that iteration is not necessary. On the NCI60 data we run the algorithm twice, starting with S=30 genes to compute the initial scores. The selected model is identical to the one we get with S=10, except that one additional cluster of 5 genes is selected for classi cation. A second iteration of the algorithm, with updated optimal scores, generates the model we get with S=10. If we start with S=40 or S=50 genes, the optimal scores are \noisy" and a large model is selected after the rst run of the algorithm. The updated optimal scores are therefore also noisy, and iteration does not help. Reiterating the algorithm with S=10 as a starting point gives back the same model. We conclude that S=10 limits the need for iterating the algorithm. The simultaneous clustering and subset selection algorithm generates \active" and \inactive" clusters. Active clusters are those that are selected in the classi cation model. For convenience, in this section we will refer to our algorithm as SimClust (simultaneous clustering and subset selection). Separate clustering followed by subset selection is referred to as SepClust, individual gene selection as IndSelect, and geneshaving as GS. NCI60 data The National Cancer Institute's anti-cancer drug-screen data (NCI60) of Ross et al. (2000) consists of n=61 samples from human cancer cell lines. Gene expression levels were measured for p=9703 genes. The samples were grouped into 8 classes according to tumor site: 9 breast, 5 central nervous system (CNS), 7 colon, 8 leukemia, 8 melanoma, 9 non-small-lung-carcinoma (NSCLS), 6 ovarian and 9 renal. Genes that had more than 2 missing values were screened out. The remaining missing values were imputed by the mean value of 5 nearest neighbor genes. The retained data matrix is thus of dimension 61 5244. We run our algorithm on the full data set. The SimClust selected model consist of 8 gene clusters, 6 of which are \active", and 5 discriminant variables (out of 7). The active clusters contain 145 genes and cluster sizes range from 4 genes to 47 genes. The training error rate is 11.1 %. With SepClust we get 5 gene clusters, 4 of which are active. The cluster sizes range from 12 to 78, and the active gene clusters contain 195 genes. The training error is 19.7 %. DLDA and GS also have training error 11.1 %, and NN has training error 14.8 %. In NN 2 nearest neighbors were selected. The GS clusters range 5 in size from 2 to 8, containing 40 genes. In Figure 1 we Fig 1 show the di erence between SepClust and Simclust gene clusters. The left plot shows the samples (columns) and gene clusters (rows) generated by SepClust. It is di cult to see clear block patterns on this noisy data set. The inactive cluster is heterogeneous within sample classes, and is dropped in the subset selection. The plot to the right show the results using SimClust. The active clusters have been split into new active and inactive clusters. The new inactive clusters are noisy and heterogeneous, but the new active clusters are much cleaner than those generated by SepClust. In Figure 2 the crossvalidation test error rates are Fig 2 shown, comparing SimClust, SepClust and IndSelect (top row), to NN, GS, and DLDA. The SimClust test error rates are comparable to NN and DLDA, with the added bene t of gene cluster information. SepClust and GS perform worse, and IndSelect poorly. In Table 1 the average (over crossvalidation training sets) selected models are shown. The models selected by SimClust are on average larger than the ones selected by SepClust, but the SimClust models contain fewer genes. On average SimClust selects 9.4 gene clusters, of which 6.9 are active. Selecting clusters of genes instead of individual genes reduce the instability in model selection. The standard deviation of SimClust selected number of clusters, and active clusters are .6 and .4 respectively. The IndSelect scheme selects on average 9.3 genes, with a standard deviation of 2.0. Acute Leukemia data, 3 classes The leukemia data set of Golub et al (1999) consists of 25 AML (acute myeloid leukemia) and 47 ALL (acute lymphoblastic leukemia) samples. The ALL samples were subdivided into 38 B-cell ALL and 9 T-cell ALL. The gene expressions were measured using high-density oligonucleotide arrays with p =6817 human genes. After excluding genes with max=min < 5 and max min < 500 (across samples) the retained data matrix is of dimension 72 3571. We run our algorithm on the full data set. SimClust results in 7 gene clusters, 3 of which are active and contain 100 genes. The cluster sizes range from 14 to 51. SepClust generates 5 gene clusters, 3 active containing 144 genes. The cluster sizes range from 22 to 53. The training error rate is 1.4 % with either scheme (one error, sample # 67). NN also results in one error, whereas GS and DLDA have a training error of 0. In Figure 3 we compare Fig 3 the clusters generated by SepClust (left plot) and SimClust (right plot). The picture is clearer on the Leukemia data than NCI60. The active clusters generated by SepClust, though inhomogeneous, consist of genes that are upregulated in one sample class only. The inactive clusters contain genes that are upregulated in multiple sample classes, to varying degree. In the right plot SimClust has split the active clusters into new homogeneous clusters. The active clusters are now clearly comprised of genes that are upregulated only within a given sample class. The inactive clusters are mixed, consisting of genes that are upregulated to a lesser degree in multiple classes. In Figure 4 the crossvalidation test error rates areFig 4 shown for the 3-class leukemia data set. SimClust performs as well as NN, but with added gene clustering information. Sepclust is on par with DLDA, whereas IndSelect and GS perform worse. In Table 2 the selected models are compared. SimClust generates on average 6.8 gene clusters, 3.0 active consisting of 101.8 genes. The standard deviation of active and number of clusters are .3 and .7 respectively. SepClust results in on average fewer, but bigger clusters. IndSelect selects on average 7.3 genes, with standard deviation 1.5. Acute Leukemia data, 2 classes We now consider the Acute Leukemia data as a two class problem, not making any distinction between the ALL-B and ALL-T subclasses. SimClust generates 4 clusters on the full data set, with 2 active clusters consisting of 76 genes. SepClust generates 4 clusters, with 2 active clusters consisting of 71 genes. The simultaneous approach has little impact in this two-class problem. The active clusters contain largely the same set of genes. The training error is again 1.4 %, which is also the case with NN and GS. DLDA results in an error rate of 2.8 % in this case. In Figure 5 the crossvalidation test error rates areFig 5 shown. SimClust now outperforms NN and DLDA, shifting the mode of the error rate distribution to 0 as opposed to 1. SepClust is comparable to SimClust, GS performs worse than NN and DLDA, and IndSelect worse still. In the bottom portion of Table 2, the selected models are shown. The SimClust and SepClust methods perform similarly, and are both more stable than IndSelect. Colon data Alon et al (1999) used high-density oligonucleotide arrays to study gene expression patterns in tumor and normal colon tissues. The expression levels of p =6500 genes were measures in n =62 samples, 40 tumor and 22 normal colon tissue samples. The 2000 genes with the highest minimal intensity across samples were selected for further study. Using SimClust on the full data, 4 gene clusters are generated, 3 active containing 35 genes. The training error is 11.9 % (7 errors). SepClust results in 3 clusters, 2 active containing 193 genes. The training error is 24.2 %. NN and DLDA have training errors 12.9 %, whereas GS has training error 9.7 %. In Figure 6 the crossvalidation test error rates are 6 shown. SimClust outperforms NN and DLDA, shifting Fig 6 the mode of the error distribution toward zero. SepClust and IndSelect perform worse, with comparable results obtained with GS. In Table 3 the model sizes are shown. SimClust generates on average 4.1 gene clusters, with 2.4 active consisting of 70.6 genes. SepClust generates fewer, and bigger clusters. IndSelect selects on average 4.2 genes, with standard deviation 2.2. The standard deviations of the number of clusters and active clusters of SimClust are .6 and .4 respectively, indicating that clustering reduces the instability of model selection. CONCLUSIONS AND DISCUSSION We present a new MDL model selection criterion for the simultaneous clustering of genes and subset selection of gene clusters for sample classi cation. Our method gives competitive or improved test error rate results, compared to some of the best methods reported in the literature, on 3 data sets. In addition, we show that the test error rates and model selection instability are much reduced by selecting gene clusters, instead of individual genes. We show that the gene clusters generated by our algorithm, including a classi cation criterion, are di erent than clusters that are generated in a supervised, but separate and directional manner. The simultaneous clustering and subset selection algorithm generates more homogeneous gene clusters, and separates \active", clusters that are class predictive, from \inactive" clusters, that exhibit crosssample variation but are not necessarily good class predictors. If the sample class labels are unknown, a two-way simultaneous clustering approach is needed. We can extend our method to the simultaneous gene and sample clustering. Here, we based the gene clustering on gene expressions in C dimensions, where C is the number of sample classes. When the class labels are unknown, twoway clustering entails the selection of the dimension in which to cluster. The residual data matrix, from the Cdimensional average matrix, is no longer independent of C or the clustering if the class labels are unknown. Thus, simultaneous clustering is obtained via the inclusion of a code length for the residual matrix. We used linear discriminant analysis with our gene clustering method in this paper, but any postprocessing classi er can be applied to the selected models (Jornsten (2001)). On these data sets, the selected models are small and linear discriminant analysis performs well. To make the MDL treatment of simultaneous clustering and classication complete, we need to extend MDL variable selection to classi cation techniques other than discriminant analysis. This remains an open problem. Two-stage code lengths can be found for penalized discriminant analysis via penalized optimal scoring. However, the derivation of mixture MDL code lengths for generalized linear models, like logistic regression, requires development of new methodology. ACKNOWLEDGEMENTS This research is partially supported by grants from NSF (FD98-02314, DMS-9803063, FD01-12731). FIGURES 10 20 30 40 50 60 5 0 1 0 0 1 5 0 2 0 0 Observations G e n e s inactive 10 20 30 40 50 60 5 0 1 0 0 1 5 0 2 0 0 Observations G e n e s inactive inactive Figure 1: Simclust error rate Fr eq ue nc y 0.0 0.2 0.4 0.6 0.8 0 10 20 30 40 Sepclust error rate Fr eq ue nc y 0.0 0.2 0.4 0.6 0.8 0 10 20 30 40 Indselect error rate Fr eq ue nc y 0.0 0.2 0.4 0.6 0.8 0 10 20 30 40 NN error rate Fr eq ue nc y 0.0 0.2 0.4 0.6 0.8 0 10 20 30 40 GS error rate Fr eq ue nc y 0.0 0.2 0.4 0.6 0.8 0 10 20 30 40 DLDA200 error rate Fr eq ue nc y 0.0 0.2 0.4 0.6 0.8 0 10 20 30 40 Figure 2: 10 20 30 40 50 60 70 5 0 1 0 0 1 5 0 2 0 0 Observations G e n e s active active active 10 20 30 40 50 60 70 5 0 1 0 0 1 5 0 2 0 0 Observations G e n e s active active active Figure 3: 7 Simclust error rate Fr eq ue nc y 0.00 0.10 0.20 0 20 40 60 80 Sepclust error rate Fr eq ue nc y 0.00 0.10 0.20 0 20 40 60 80 Indselect error rate Fr eq ue nc y 0.00 0.10 0.20 0 20 40 60 80 NN error rate Fr eq ue nc y 0.00 0.10 0.20 0 20 40 60 80 GS error rate Fr eq ue nc y 0.00 0.10 0.20 0 20 40 60 80 DLDA200 error rate Fr eq ue nc y 0.00 0.10 0.20 0 20 40 60 80 Figure 4: Simclust error rate Fr eq ue nc y 0.00 0.10 0.20 0 10 20 30 40 50 60 70 Sepclust error rate Fr eq ue nc y 0.00 0.10 0.20 0 10 20 30 40 50 60 70 Indselect error rate Fr eq ue nc y 0.00 0.10 0.20 0 10 20 30 40 50 60 70 NN error rate Fr eq ue nc y 0.00 0.10 0.20 0 10 20 30 40 50 60 70 GS error rate Fr eq ue nc y 0.00 0.10 0.20 0 10 20 30 40 50 60 70 DLDA200 error rate Fr eq ue nc y 0.00 0.10 0.20 0 10 20 30 40 50 60 70 Figure 5: Simclust error rate Fr eq ue nc y 0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 50 Sepclust error rateFrequency0.0 0.1 0.2 0.3 0.4 0.501020304050Indselect error rateFrequency0.0 0.1 0.2 0.3 0.4 0.501020304050NNerror rateFrequency0.0 0.1 0.2 0.3 0.4 0.501020304050GSerror rateFrequency0.0 0.1 0.2 0.3 0.4 0.501020304050DLDA200error rateFrequency0.0 0.1 0.2 0.3 0.4 0.501020304050Figure 6:REFERENCESAlon, U., Barkai, N., Notterdam, D.A., Gish, K., Ybarra,S., Mack, D., Levine, A.J. (1999) Broad patterns of geneexpression revealed by clustering analysis of tumor andnormal colon tissues probed by oligonucleotide arrays.Proc. Natl. Acad. Sci, 96:6745-6750Dudoit, S., Fridlyand, J., Speed, T. (2000) Comparisonof discrimination methods for the classi cation of tumorsusing gene expression data. Technical report, Departmentof Statistics, UC BerkeleyGolub, T.R., Slonim, D.K., Tamayo, P., Huard, C.,Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L.,Downing, J.R., Caligiuri, M.A., Bloom eld, C.D., Lan-der, E.S. (1999) Molecular classi cation of cancer: classdiscovery and class prediction by gene expression moni-toring. Science 286: 531-537Jornsten, R. (2001) Data compression and its statisticalimplications, with and application to the analysis of mi-croarray images. PhD thesis, Department of Statistics,UC BerkeleyHansen, M., Yu, B. (2001) Model selection and the prin-ciple of minimum description length. JASA 96(454):746-774Hastie, T., Tibshirani, R., Eisen, M., Brown, P., Ross, D.,Scherf, U., Weinstein, J., Alizadeh, A., Staudt, L., Bot-stein, D. (2000) Gene Shaving: a new class of clusteringmethods for expression arrays. Technical report, Depart-ment of Statistics, Stanford University.Hastie, T., Tibshirani, R., Botstein, D., Brown, P. (2000)Supervised harvesting of expression trees. Technical re-port, Department of Statistics, Stanford UniversityHastie, T., Tibshirani, R., Buja, A. (1994) Flexible dis-criminant analysis. JASA 89 1255:1270Li, H.,Hong, F. (2001) Cluster-rasch model for microar-ray gene expression data. Genome biology 2(8-31):1-13Rissanen, J. (1989) Stochastic complexity Singapore worldscienti cRoss, D.T., Scherf, U., Eisen, M., Perou, C.M., Spellman,P., Iyer, V., Je rey, S.S., Van de Rijn, M., Waltham, M.,Pergamenschikov, A., Lee, J.C.F., Lashkari, D., Shalon,D., Myers, T.G., Weinstein, J.N., Botstein, D., Brown,P.O. (2000) Systematic variation in gene expression pat-terns in human cancer cell lines. Nature genetics 24:227-234Wallace, C.S., Boulton, D.M. (1994) Intrinsic classi ca-tion by mml the snob program. Proc. 8th AustralianJoint Conference on AI8 FIGURE CAPTIONSFigure 1: NCI60. Left; Separate clustering and subsetselection. The inactive cluster is inhomogeneous withinsample classes. Right; Simultaneous clustering and sub-set selection. Fewer genes belong to active clusters. Theactive clusters are more homogeneous than those in theleft plot.Figure 2: NCI60. Crossvalidation test error rates. The si-multaneous clustering and subset selection method (Sim-clust) show competitive test error rates to NN and DLDA.Figure 3: Acute Leukemia (3 classes). Left; Separate clus-tering and subset selection. The active clusters (thoughheterogeneous) consist of genes that are upregulated pre-dominantly in one sample class. Right; Simultaneousclustering and subset selection. The active clusters arecleaner than those to the left. The are clearly comprisedof genes that are upregulated in one sample class only,whereas inactive clusters consist of genes that are upreg-ulated in multiple classes.Figure 4: Acute Leukemia (3 classes). Crossvalidationtest error rates. The simultaneous clustering and subsetselection (Simclust) show competitive test error rates toNN.Figure 5: Acute Leukemia (2 classes). Crossvalidationtest error rates. The simultaneous clustering and subsetselection (Simclust) outperforms NN and DLDA.Figure 6: Colon (2 classes). Crossvalidation test errorrates. The simultaneous clustering and subset selection(Simclust) outperforms NN and DLDA.9 TABLESNCI60 SimClust SepClust IndSelect NNGSk6.9(.4)3.9(.9)9.3(2.0) 2.2(.6) NAK9.4(.6)4.8(.8)NANANA# genes132.1162.69.32.238.9Table 1: NCI60 Selected models on 150 crossvalidationdata sets. Average and (standard deviation) of modelsizes. K is the number of clusters. k is the number ofactive clusters, number of individually selected genes, ornumber nearest neighbors respectively.Leukemia3 SimClust SepClust IndSelectNNGSk3.0(.3)3.1(.5)7.3(1.5) 3.4(2.0) NAK6.8(.7)5.0(.3)NANANA# genes101.8141.27.33.421.5Leukemia2k2.2(.1)2.2(.3)4.3(1.4) 3.2(1.6) NAK4.0(.2)4.0(0)NANANA# genes85.988.34.33.224.9Table 2: Leukemia (3 and 2 classes) Selected modelson 150 crossvalidation data sets. Average and (standarddeviation) of model sizes.Colon SimClust SepClust IndSelectNNGSk2.4(.4)2.1(.3)4.2(2.2) 4.2(1.8) NAK4.1(.6)3.5(.7)NANANA# genes70.6101.64.24.224.2Table 3: Colon Selected models on 150 crossvalidationdata sets. Average and (standard deviation) of modelsizes.10
منابع مشابه
Simultaneous Gene Clustering and Subset Selection for Sample Classification Via MDL
MOTIVATION The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The high-dimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. The clusters may be indicative of genetic pathways. Parallel to gene clustering is the important application of sample classif...
متن کاملInvestigation on Several Model Selection Criteria for Determining the Number of Cluster
Abstract Determining the number of clusters is a crucial problem in clustering. Conventionally, selection of the number of clusters was effected via cost function based criteria such as Akaike’s information criterion (AIC), the consistent Akaike’s information criterion (CAIC), the minimum description length (MDL) criterion which formally coincides with the Bayesian inference criterion (BIC). In...
متن کاملClustering financial time series: an application to mutual funds style analysis
Classi#cation can be useful in giving a synthetic and informative description of contexts characterized by high degrees of complexity. Di3erent approaches could be adopted to tackle the classi#cation problem: statistical tools may contribute to increase the degree of con#dence in the classi#cation scheme. A classi#cation algorithm for mutual funds style analysis is proposed, which combines di3e...
متن کاملFeature Selection: Evaluation, Application, and Small Sample Performance
A large number of algorithms have been proposed for feature subset selection. Our experimental results show that the sequential forward oating selection (SFFS) algorithm, proposed by Pudil et al., dominates the other algorithms tested. We study the problem of choosing an optimal feature set for land use classi cation based on SAR satellite images using four di erent texture models. Pooling feat...
متن کاملOn Unsupervised Learning of Mixtures of Markov Sources Thesis submitted for the degree \Master of Science"
Unsupervised classi cation, or clustering, is one of the basic problems in data analysis. While the problem of unsupervised classi cation of independent random variables has been deeply investigated, the problem of unsupervised classi cation of dependent random variables, and in particular the problem of segmentation of mixtures of Markov sources, has been hardly addressed. At the same time sup...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003